The American Housing Survey

The American Housing Survey is a regular survey conducted by the US Census in coordination with the US Department of Housing and Urban Development that asks questions about housing and affordability. Here is what the US Census says about the AHS:

The American Housing Survey (AHS) is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau. The survey has been the most comprehensive national housing survey in the United States since its inception in 1973, providing current information on the size, composition, and quality of the nation’s housing and measuring changes in our housing stock as it ages. The AHS is a longitudinal housing unit survey conducted biennially in odd-numbered years, with samples redrawn in 1985 and 2015 .

The survey provides up-to-date information about the quality and cost of housing in the United States and major metropolitan areas. The survey also includes questions about:

  • the physical condition of homes and neighborhoods,
  • the costs of financing and maintaining homes, and
  • the characteristics of people who live in these homes.
  • Planners, policy makers, and community stakeholders use the results of the AHS to assess the housing needs of communities and the country. These statistics inform decisions that affect the housing opportunities for people of all income levels, ages, and racial and ethnic groups.

Since our country changes rapidly, policymakers in government and private organizations need current housing information to make decisions about programs that will affect people of all income levels, ages, and racial and ethnic groups.

In this project, we will use the 2021 Public Use Files. These files strip identifying information from the respondents, including location information such as the state or city. The data are also broken up into four individual files.

You should become familiar with the codebook for the version of the survey we are using. When you find variables, you can expand the description by pressing the + button and then clicking on the “Survey Year: 2021 National” option. Here is an example describing the variable that serves as the primary key for each table (each table has its own control number, one per row).

Looking up CONTROL variable
Looking up CONTROL variable

Loading data

Load each of the files listed above into a table (I suggest using household, project, etc., and further tasks will mention these tables by name). While loading the data, pay particular attention to the following:

is_num<-function(x){
  all(str_detect(x,"(^'[0-9])|('-)"))
}

strip_and_convert <- function(x) {
  d<-(gsub("(^')|('$)", "", x))
  as.numeric(d)
}

is_na <- function(x) {
  ifelse((x==-9|x==-6),NA,x)
}

import<-function(file){
  read_csv(file=file,col_select =matches("^JOB|^[^J]")) %>% 
    select(-matches("REPW")) %>% 
    mutate_if(is_num,strip_and_convert) %>% mutate_if(is.numeric,is_na)
}

project <- import('../data/ahs_2021/project.csv.gz')
mortgage <- import("../data/ahs_2021/mortgage.csv.gz")
household <- import("../data/ahs_2021/household.csv.gz")
person <- import("../data/ahs_2021/person.csv.gz")
dim(project)
## [1] 51476     8
dim(mortgage)
## [1] 19155    19
dim(household)
## [1] 64141   305
dim(person)
## [1] 135926     49

Recoded data:

household<-household %>% mutate(BLD=recode(BLD,
  '0'='Mobile home or trailer',
  '1'='One-family house, detached',
  '2'='One-family house, attached',
  '3'='Mobile home or trailer',
  '4'='2 apartments',
  '5'='3 to 4 apartments',
  '6'='5 to 9 apartments',
  '7'='10 to 19 apartments',
  '8'='20 to 49 apartments',
  '9'='50 or more apartments',
  '10'='Boat, RV, van, etc.'
),
BATHROOMS=recode(BATHROOMS,
  '1'='One full bathroom',
  '2'='1.5 bathrooms',
  '3'='2 bathrooms',
  '4'= '2.5 bathrooms',
  '5'= '3 bathrooms',
  '6'= 'More than 3 bathrooms',
  '7'= 'No full bath: sink and tub present',
  '8'= 'No full bath: sink and toilet present',
  '9'= 'No full bath: tub and toilet present',
  '10'= 'No full bath: sink only',
  '11'= 'No full bath: tub only',
  '12'= 'No full bath: toilet only',
  '13'= 'No full bath: no sink, tub or toilet'
),
DIVISION=recode(DIVISION,
'1' ='New England',
'2' = 'Middle Atlantic',
'3' = 'East North Central',
'4' = 'West North Central',
'5' = 'South Atlantic',
'6' = 'East South Central',
'7' = 'West South Central',
'8' = 'Mountain',
'9' = 'Pacific'
),
HOA=ifelse(HOA==1,'true','false')
)
person<-person %>% 
  mutate(
  RACE=recode(RACE,
  '1'="White only",'2'='Black only',
  '3'='American Indian, Alaska Native only',
  '4'='Asian only',
  '5'='Hawaiian, Pacific Islander only',
  '6'='White / Black','7'='White / American Indian, Alaska Native',
  '8'='White / Asian',
  '9'='White / Hawaiian, Pacific Islander',
  '10'='Black / American Indian, Alaska Native',
  '11'='Black / Asian',
  '12'='Black / Hawaiian, Pacific Islander',
  '13'='American Indian, Alaska Native / Asian',
  '14'='Asian / Hawaiian, Pacific Islander',
  '15'='White / Black / American Indian, Alaska Native',
  '16'='White / Black / Asian',
  '17'='White / American Indian, Alaska Native / Asian',
  '18'= 'White / Asian / Hawaiian, Pacific Islander',
  '19'= 'White / Black / American Indian, Alaska Native / Asian',
  '20'= 'Other combinations of 2 or 3 races',
  '21'= 'Other combinations of 4 or more races'
))
mortgage<-mortgage %>% 
  mutate(LOANTYPE=recode(LOANTYPE,
  '1'='Primary loan','2'='Secondary loan','3'='Home equity line of credit'
))

Exploring Household Data

Required

  • Provide a plot that shows the number of households in each DIVISION
  • Provide a plot of the marginal distribution of YRBUILT (hint: what kind of variable is this? See the codebook.)
  • Is the market value (MARKETVAL) typically higher for households that have a homeowner’s association (HOA)?
  • Create a column that replaces UNITSIZE values with the midpoint of the range. How does the number of bedrooms change with larger homes?
  1. the number of households in each DIVISION.
household %>% 
  ggplot(aes(y=factor(DIVISION,levels = names(sort(table(DIVISION))))))+
  geom_bar(fill=4,col=1)+labs(title = "Regional quantity distribution",
                              y="DIVISION")

  1. The codebook says that for public reasons, YRBUILT is a categorical value, a time variable; The year of construction of the unit can be determined using multiple projects. It’s not continuous. The value could be a definite year or it could be a rough date because someone can’t remember it
ggplot(household,aes(x=YRBUILT))+
  geom_histogram(fill=4,col=1)

#Use categorical variables to remove the interval
ggplot(household,aes(y=factor(YRBUILT)))+
  geom_bar(fill=4,col=1)

  1. Yes, homes with a Homeowners Association (HOA) generally have a higher MARKETVAL. The distribution is similar and the value shifts upward, but the difference is not very significant.
household %>% filter(!is.na(HOA))%>%group_by(HOA) %>%
  ggplot(aes(y = MARKETVAL, x = HOA)) +
  stat_summary(
    fun.min = min,
    fun.max = max,
    fun = median
  )

household %>% filter(!is.na(HOA))%>%group_by(HOA) %>% summarise(mean=mean(MARKETVAL,na.rm = T),sd=sd(MARKETVAL,na.rm = T))
## # A tibble: 2 × 3
##   HOA      mean      sd
##   <chr>   <dbl>   <dbl>
## 1 false 418583. 548318.
## 2 true  497148. 556395.
household %>% filter(!is.na(HOA)) %>% ggplot(aes(y=MARKETVAL,x=HOA,fill=HOA))+geom_violin()

household %>% filter(!is.na(HOA)) %>%ggplot(aes(y=MARKETVAL,x=HOA,fill=HOA))+geom_boxplot(width=.3,position=position_dodge(width=0.9))+ylim(0,2*10^6)

  1. The variables are ordered categorical variables of numerical transformation, the points are relatively concentrated, and the positions of the points are shown on the diagram. And in general, the bigger the house, the more bedrooms it has.The jitter plot can disperse the points that converge.
household2<- household %>% 
  mutate(UNITSIZE_midpoint=recode(UNITSIZE,
  '1'=250,
  '2'=625,
  '3'=875,
  '4'=1250,
  '5'=1750,
  '6'=2250,
  '7'=2750,
  '8'=3500,
  '9'=4000
)) 
household2%>% ggplot(aes(y=BEDROOMS,x=UNITSIZE_midpoint)) +geom_point(position = "jitter",alpha=0.4,size=0.05,col=1)+geom_smooth(method = "lm",col='red')

household2%>% filter(!is.na(UNITSIZE_midpoint))%>% ggplot(aes(y=factor(BEDROOMS),x=factor(UNITSIZE_midpoint))) +geom_bin_2d()+scale_fill_gradient(low = "lightblue",high = "red")

Open Ended

  • Use group_by and summarize to investigate a variable not used as a grouping factor in the required sections above. Write up your findings in written form (3 to 5 sentences).
  • If you were looking for an affordable house, where would you choose to live? (Which region as coded in the DIVISION column?)? Define how you will define “affordable” and explain how will you choose to select a region using your measurement of affordability. Implement affordability using mutate. Compare the regions and explain your results.
  • Select two numeric variables not used above and investigate the relationship between the two. Use both graphical and numeric summaries. Write up your findings in a short paragraph.
  1. About the bathroom condition and the value of the house,
  • No full bath accounted for a small percentage, which could be due to financial difficulties, among other reasons
  • Most houses have one or more fully equipped bathrooms
  • Generally speaking, the more bathrooms there are, the better the facilities are, and the house value is relatively high.
household %>% group_by(BATHROOMS) %>% summarise(n=n()) %>% arrange(-n) %>%  mutate(percentage =paste0(round(n/sum(n)*100,3),"%"))
## # A tibble: 11 × 3
##    BATHROOMS                                 n percentage
##    <chr>                                 <int> <chr>     
##  1 One full bathroom                     23872 37.218%   
##  2 2 bathrooms                           18235 28.43%    
##  3 2.5 bathrooms                          7958 12.407%   
##  4 1.5 bathrooms                          6426 10.019%   
##  5 3 bathrooms                            6008 9.367%    
##  6 More than 3 bathrooms                  1495 2.331%    
##  7 No full bath: no sink, tub or toilet    129 0.201%    
##  8 No full bath: sink and toilet present     9 0.014%    
##  9 No full bath: toilet only                 5 0.008%    
## 10 No full bath: tub and toilet present      3 0.005%    
## 11 No full bath: sink and tub present        1 0.002%
household %>% group_by(BATHROOMS) %>% summarise(u=mean(MARKETVAL,na.rm=T)) %>% ggplot(aes(x=u,y=BATHROOMS))+geom_point()+geom_line()

household %>%ggplot(aes(x=MARKETVAL,y=BATHROOMS))+geom_violin(fill=4)

  • “TOTHCAMT” is the total monthly housing cost, and “FINCP” is the total household income. Choose housing expenses as a percentage of median income to measure the height of house prices, rather than the amount, because price levels can vary between regions,The median is more representative of the regional average, removing the effect of the large gap between rich and poor
  • The median household income and median housing expenditure of each region are selected to reflect the level of each region
  • West North Central may be a place for affordable housing.The median expend is 879.0 and income is 48000.
household %>%group_by(DIVISION) %>% summarise(expend=median(TOTHCAMT,na.rm = T),income=median(FINCP,na.rm = T)) %>% mutate(Disposable_income=income-expend,p=expend/income) %>% arrange(p)
## # A tibble: 9 × 5
##   DIVISION           expend income Disposable_income      p
##   <chr>               <dbl>  <dbl>             <dbl>  <dbl>
## 1 West North Central   879   48000            47121  0.0183
## 2 East North Central   958   48600            47642  0.0197
## 3 Mountain            1200.  55000            53800. 0.0218
## 4 West South Central  1094   50000            48906  0.0219
## 5 East South Central   680   30700            30020  0.0221
## 6 South Atlantic      1218   52000            50782  0.0234
## 7 Pacific             1656   66950            65294  0.0247
## 8 Middle Atlantic     1181   46780            45599  0.0252
## 9 New England         1530   60000            58470  0.0255
  • “ELECAMT” is the monthly electricity charge. “GASAMT” is the monthly gas charge. The 0-4 numbers are removed because they represent vacant houses or free houses or those that have been paid for. The scatter plot shows that households that use more electricity generally use more water. But most people are at a low level
household %>% filter(!ELECAMT%in%c(0:4)&!GASAMT%in%c(0:4))%>% ggplot(aes(y=ELECAMT,x=GASAMT ))+geom_point(size=0.05,col=2)+geom_smooth()

Exploring Person Data

Required

  • Investigate the PERSONID variable. How is variable different from CONTROL? If you want to find out how many people live in a household, how can you do that? What is the maximum number of people living in a household in this data set?
  • For each household, find the minimum age, median age, and maximum age. Summarize the average of these three values. (Save the table of min, median, and max age for later).
  1. CONTROL represents the head of the household, and each head may have multiple members of the PERSONID living in a room. The largest family was 19 people
person %>% group_by(PERSONID)%>% summarise(n=n())
## # A tibble: 135,926 × 2
##       PERSONID     n
##          <dbl> <int>
##  1 11000005001     1
##  2 11000005002     1
##  3 11000007005     1
##  4 11000007006     1
##  5 11000007007     1
##  6 11000010002     1
##  7 11000010003     1
##  8 11000013001     1
##  9 11000013002     1
## 10 11000014003     1
## # ℹ 135,916 more rows
person %>% group_by(CONTROL)%>% summarise(n=n()) %>% arrange(-n)
## # A tibble: 56,058 × 2
##     CONTROL     n
##       <dbl> <int>
##  1 11025555    19
##  2 11046927    16
##  3 11037862    13
##  4 11072988    13
##  5 11082642    13
##  6 11025451    12
##  7 11025524    12
##  8 11068752    12
##  9 11075513    12
## 10 11084358    12
## # ℹ 56,048 more rows
age<-person %>% group_by(CONTROL) %>% summarise(NUM=n(),min_age=min(AGE),median_age=median(AGE),max_age=max(AGE),mean=(min_age+median_age+min_age)/3)
age
## # A tibble: 56,058 × 6
##     CONTROL   NUM min_age median_age max_age  mean
##       <dbl> <int>   <dbl>      <dbl>   <dbl> <dbl>
##  1 11000005     2      73       73.5      74  73.2
##  2 11000007     3      22       55        55  33  
##  3 11000010     2      55       67.5      80  59.2
##  4 11000013     2      60       60.5      61  60.2
##  5 11000014     3      10       38        52  19.3
##  6 11000017     3      13       43        49  23  
##  7 11000018     1      21       21        21  21  
##  8 11000019     1      80       80        80  80  
##  9 11000021     6       4       16        38   8  
## 10 11000022     1      70       70        70  70  
## # ℹ 56,048 more rows

Open Ended

  • The person table contains several categorical variables. Using grouping on more than variable to explore the relationships between these variables. For example, for each level of variable A, what is the most common value of variable B? Write up what you find.

  • MLPE served in the Vietnam War, most people did not answer whether they participated in the Vietnam War, should be not understand

  • The percentage of Hispanics who did not serve in the Vietnam War was higher than the percentage of Hispanics who served in the Vietnam War

person %>% mutate(MLPE=recode(MLPE,"1"='yes',"2"='no'),
                  SPAN=recode(SPAN,"1"='yes',"2"='no'))%>%  
  group_by(MLPE) %>% count(SPAN)%>% 
mutate(p=paste0(round(n/sum(n)*100,2),"%"))%>% arrange(-n,.by_group=T)
## # A tibble: 6 × 4
## # Groups:   MLPE [3]
##   MLPE  SPAN       n p     
##   <chr> <chr>  <int> <chr> 
## 1 no    no      4230 91.42%
## 2 no    yes      397 8.58% 
## 3 yes   no      2363 95.01%
## 4 yes   yes      124 4.99% 
## 5 <NA>  no    100230 77.81%
## 6 <NA>  yes    28582 22.19%

Exploring Mortgage Data

Required

  • Can a single property have more than one mortgage?
  • Plot the marginal distribution of PMTAMT and interpret the results.
  • Mortgages (loans using property as collateral) can be broken down into primary (has first opportunity to recoup costs in case of a default), secondary (only recoups costs after the primary is paid), and home equity (a type of secondary where money is loaned based on the difference between the outstanding mortgage amount and the assessed value of the property, the “equity”). The AMMORT contains information on the amount of the mortgage for primary and secondary types. With home equity lines of credit (HELOC) loans, home owners have a credit limit that they could borrow at any time, which is contained in the HELOCLIM column. Create a new column that merges AMMROT and HELOCLIM to give a picture how much money has or could be borrowed. Plot the conditional distribution of this new column for each of the LOANTYPE categories. Compare both typical values and variation. Write up a 2 or 3 sentences explaining your results.

there are more than one mortgage,like Primary loan and Secondary loan.And the families shown below have three loans

mortgage %>%count(CONTROL) %>% arrange(-n) %>% head(5)
## # A tibble: 5 × 2
##    CONTROL     n
##      <dbl> <int>
## 1 11005033     3
## 2 11005360     3
## 3 11015764     3
## 4 11023551     3
## 5 11032795     3
  • “PMTAMT” is a right-skewed distribution, with data concentrated on the left and some people with high levels of mortgages. Because most people choose a lower amount when they apply for a mortgage.
mortgage %>% ggplot(aes(PMTAMT))+geom_histogram(bins=100,fill=2,col=1)+xlim(0,15000)

  • we can compare the median values for each loan type,Primary loan is higher, as well as the interquartile ranges and there are a lot of outliers.
mortgage<-mortgage %>% mutate(both_amount=ifelse(LOANTYPE%in%c('Primary loan','Secondary loan'),AMMORT,HELOCLIM))
mortgage %>% ggplot(aes(x=both_amount,y=LOANTYPE))+geom_boxplot()

mortgage %>% ggplot(aes(x=both_amount,y=LOANTYPE))+geom_boxplot()+xlim(0,2.5*10^6)

Open Ended

  • Compare refinanced mortgages to non-refinanced mortgages (REFI). Include at least one plot.
  • Demonstrate the use of the filter function to limit attention to properties with more than one mortgage. For these properties, compare the different types of mortgages. For which types of loans do people carry the highest debt?

the loan amount for refinancing was higher on mean and median.

mortgage %>% filter(!is.na(REFI))%>% mutate(REFI=recode(REFI,'1'="yes",'2'="no"))%>%  ggplot(aes(y=both_amount,x=REFI,fill=REFI))+geom_boxplot()+ylim(0,2.5*10^6)

mortgage %>% filter(!is.na(REFI))%>% mutate(REFI=recode(REFI,'1'="yes",'2'="no")) %>% group_by(REFI) %>% summarise(u=mean(both_amount),median=median(both_amount,na.rm = T))
## # A tibble: 2 × 3
##   REFI        u  median
##   <chr>   <dbl>   <dbl>
## 1 no    244919. 192902.
## 2 yes   270502. 215351

Primary loan types of loans that people carry the highest debt.

mortgage %>%group_by(CONTROL) %>% mutate(n=n()) %>% ungroup()%>% filter(n==1) %>% group_by(LOANTYPE) %>% summarise(mean=mean(both_amount,na.rm = T))
## # A tibble: 2 × 2
##   LOANTYPE                      mean
##   <chr>                        <dbl>
## 1 Home equity line of credit 151052.
## 2 Primary loan               257795.

Using EDA Before Merging

We have now investigated the three main tables individually. Use these results to help you formulate three questions that we can ask after merging the data in the next section. For each question, be sure to state why the previous results caused you to ask this question.

  1. The distribution of variables mainly analyzed above includes TOTHCAMT, MARKETVAL, and FINCP. we consider the condition distribution of different races to see if there is racial discrimination and other situations.Which race may have higher home values or household incomes

  2. Because we looked at the level of housing consumption in different regions, and further considered the influence of race, which may affect normal life, we looked at whether there were differences in the number of major ethnic groups in different regions, their level of rent spending, and what is the total household income

  3. Since we have studied the relationship between the number of bedrooms and the size of the house, different groups of people have different life preferences, and the race added to the person table is considered to study the preferences of different groups of people.Which races prefer to live in groups.

Merging Data

Merging Households and Mortgages

Write a paragraph describing the difference between these ways of merging the household and mortgages table:

  • Use a left join on households (left) and mortgages (right) on the CONTROL column
  • Use an inner join on households (left) and mortgages (right) on the CONTROL column
  • First aggregate the mortgages table to get total mortgage amounts and payments, then just a left join households to the aggregated mortgages table.

If we would want to compare mortgage amounts for primary, second, and HELOC loans for each region (DIVISON column)? Implement this solution and use a facet plot to show the distributions of primary, secondary, and HELOC loans by region.

  • A left join is used to match one mortgage per line of the household data.According to each symbol of the left table matches the corresponding right table, the number of rows in the 1 pair is increased, and the number of rows is retained when there is no match

  • An inner join is employed to retain rows that have a match, removing any rows that do not have a match. When there is a one-to-many relationship, the join operation will add rows accordingly.

  • The mortgages are first consolidated into a single loan column, allowing the CONTROL identifier to uniquely correspond to each household line. Each mortgage row is matched during the merge, resulting in a table that represents the head of the household with a mortgage

mortgage %>% group_by(CONTROL) %>%
  left_join(household,mortgage,by='CONTROL')
## # A tibble: 19,155 × 324
## # Groups:   CONTROL [18,108]
##     CONTROL AMMORT  REFI MORTCLASS MORTTYPE TAXPMT PMTONLY MORTLINE REFICSHAMT
##       <dbl>  <dbl> <dbl>     <dbl>    <dbl>  <dbl>   <dbl>    <dbl>      <dbl>
##  1 11000007 250364     2         1        1      1    1629        1         NA
##  2 11000017 152441     2         1        1      1     753        1         NA
##  3 11000017 203067     2         1        1      2    1073        2         NA
##  4 11000042 383327     1         1        1      1    1395        1         NA
##  5 11000048 128553     2         1        1      1    1322        1         NA
##  6 11000055 198463     1         1        1      1    1137        1         NA
##  7 11000062  33046     2         1        2      2     178        1         NA
##  8 11000078 165332     2         1        1      1     731        1         NA
##  9 11000078     NA    NA         3       NA     NA      NA        2         NA
## 10 11000079  26324     1         1        1      1     119        1         NA
## # ℹ 19,145 more rows
## # ℹ 315 more variables: MORTADDTN <dbl>, HELOCLIM <dbl>, REFICSH <dbl>,
## #   HELOCBAL <dbl>, HELOCADD <dbl>, PMTFREQ <dbl>, LOANTYPE <chr>,
## #   MISCPMT <dbl>, PMTAMT <dbl>, INTRATE <dbl>, both_amount <dbl>,
## #   TOTROOMS <dbl>, PERPOVLVL <dbl>, DBEVICLK <dbl>, DBEVICNOTE <dbl>,
## #   DBEVICTHT <dbl>, DBEVICWHERE <dbl>, DBFORCWHR <dbl>, DBLVEFORC <dbl>,
## #   PETSCAT <dbl>, PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, …
mortgage %>% group_by(CONTROL) %>%
  inner_join(household,mortgage,by='CONTROL')
## # A tibble: 19,155 × 324
## # Groups:   CONTROL [18,108]
##     CONTROL AMMORT  REFI MORTCLASS MORTTYPE TAXPMT PMTONLY MORTLINE REFICSHAMT
##       <dbl>  <dbl> <dbl>     <dbl>    <dbl>  <dbl>   <dbl>    <dbl>      <dbl>
##  1 11000007 250364     2         1        1      1    1629        1         NA
##  2 11000017 152441     2         1        1      1     753        1         NA
##  3 11000017 203067     2         1        1      2    1073        2         NA
##  4 11000042 383327     1         1        1      1    1395        1         NA
##  5 11000048 128553     2         1        1      1    1322        1         NA
##  6 11000055 198463     1         1        1      1    1137        1         NA
##  7 11000062  33046     2         1        2      2     178        1         NA
##  8 11000078 165332     2         1        1      1     731        1         NA
##  9 11000078     NA    NA         3       NA     NA      NA        2         NA
## 10 11000079  26324     1         1        1      1     119        1         NA
## # ℹ 19,145 more rows
## # ℹ 315 more variables: MORTADDTN <dbl>, HELOCLIM <dbl>, REFICSH <dbl>,
## #   HELOCBAL <dbl>, HELOCADD <dbl>, PMTFREQ <dbl>, LOANTYPE <chr>,
## #   MISCPMT <dbl>, PMTAMT <dbl>, INTRATE <dbl>, both_amount <dbl>,
## #   TOTROOMS <dbl>, PERPOVLVL <dbl>, DBEVICLK <dbl>, DBEVICNOTE <dbl>,
## #   DBEVICTHT <dbl>, DBEVICWHERE <dbl>, DBFORCWHR <dbl>, DBLVEFORC <dbl>,
## #   PETSCAT <dbl>, PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, …
tab<-mortgage %>% group_by(CONTROL) %>% 
  mutate(sum=sum(both_amount))%>% ungroup()%>% 
  left_join(household,by='CONTROL')
mortgage %>% left_join(household,by='CONTROL') %>%
  ggplot(aes(x=both_amount,y=DIVISION))+geom_violin()+
  facet_wrap(~LOANTYPE,ncol=1,scales = 'free')

Required

  • Using the merge in the previous step, plot the household income FINCP against the total mortgage amount. Comment on the results.
  • For this step, we will use the column we created above that merged AMMORT and HELOCLIM. Suppose this column is called both_amount. Using pivoting, create columns for the both_amount_primary, both_amount_secondary, and both_amount_heloc. For mortgages with both primary and HELOC mortgages, plot the joint distribution of these values. You will probably need to group by CONTROL after pivoting to get totals.

The fitting line shows a positive relationship

tab %>% ggplot(aes(FINCP,sum))+
  geom_point()+
  geom_smooth()

tab %>% 
  pivot_wider(names_from =LOANTYPE,
              values_from = both_amount)%>% 
  group_by(CONTROL) %>% 
  summarise(both_amount_primary=sum(`Primary loan`,na.rm = T),
            both_amount_secondary=sum(`Secondary loan`,na.rm = T),
            both_amount_heloc=sum(`Home equity line of credit`,na.rm = T))
## # A tibble: 18,108 × 4
##     CONTROL both_amount_primary both_amount_secondary both_amount_heloc
##       <dbl>               <dbl>                 <dbl>             <dbl>
##  1 11000007              250364                     0                 0
##  2 11000017              152441                203067                 0
##  3 11000042              383327                     0                 0
##  4 11000048              128553                     0                 0
##  5 11000055              198463                     0                 0
##  6 11000062               33046                     0                 0
##  7 11000078              165332                     0                 0
##  8 11000079               26324                     0                 0
##  9 11000082              170943                     0                 0
## 10 11000088              260840                     0                 0
## # ℹ 18,098 more rows

Open Ended

  • Select another merge to make (household to person or person to mortgage, in any order).
  • Address one or more of your EDA questions from the previous sessions. Provide at least one plot and one use of either group_by/summarize or an iteration technique to address your question.
  • Did you find an answer to your question or did it prompt new questions? Write up a short paragraph of your results.
left_join(household,person,by='CONTROL')
## # A tibble: 144,009 × 353
##     CONTROL TOTROOMS PERPOVLVL DBEVICLK DBEVICNOTE DBEVICTHT DBEVICWHERE
##       <dbl>    <dbl>     <dbl>    <dbl>      <dbl>     <dbl>       <dbl>
##  1 11000005        8       465       NA         NA        NA          NA
##  2 11000005        8       465       NA         NA        NA          NA
##  3 11000007        8       501       NA         NA        NA          NA
##  4 11000007        8       501       NA         NA        NA          NA
##  5 11000007        8       501       NA         NA        NA          NA
##  6 11000009        5        NA       NA         NA        NA          NA
##  7 11000010        5        12        3         NA         2           1
##  8 11000010        5        12        3         NA         2           1
##  9 11000013        7       501       NA         NA        NA          NA
## 10 11000013        7       501       NA         NA        NA          NA
## # ℹ 143,999 more rows
## # ℹ 346 more variables: DBFORCWHR <dbl>, DBLVEFORC <dbl>, PETSCAT <dbl>,
## #   PETSDOG <dbl>, PETSBIRD <dbl>, PETSFISH <dbl>, PETSHH <dbl>,
## #   PETSREPT <dbl>, PETSSMAM <dbl>, RENT <dbl>, WFRALERT <dbl>, WFRROOF <dbl>,
## #   WFRSIDING <dbl>, WFRWFENCE <dbl>, WFRWPILE <dbl>, WFRWSHED <dbl>,
## #   BATHEXCLU <dbl>, DISHWASH <dbl>, SOLAR <dbl>, GARAGE <dbl>, DINING <dbl>,
## #   LAUNDY <dbl>, NOSTEP <dbl>, GUTREHB <dbl>, CONDO <dbl>, SEARCHCRED <dbl>, …
  • American Indian, Alaska Native / Asian usually owns a higher value house,We can compare the distributions according to the graph
left_join(household,person,by='CONTROL') %>% group_by(RACE) %>% summarise(TOTHCAMT1=median(TOTHCAMT),MARKETVAL2=median(MARKETVAL,na.rm=T),FINCP3=median(FINCP)) %>% arrange(-MARKETVAL2)
## # A tibble: 21 × 4
##    RACE                                           TOTHCAMT1 MARKETVAL2 FINCP3
##    <chr>                                              <dbl>      <dbl>  <dbl>
##  1 American Indian, Alaska Native / Asian             1967     931263   83010
##  2 Black / Hawaiian, Pacific Islander                 1344.    758329   36850
##  3 Asian / Hawaiian, Pacific Islander                 1909     668180  122500
##  4 Asian only                                         1955     556104  100000
##  5 White / Asian / Hawaiian, Pacific Islander         1741     548172   89200
##  6 White / Asian                                      2138     514670. 125000
##  7 Other combinations of 2 or 3 races                 1600     504230.  47000
##  8 White / American Indian, Alaska Native / Asian     2293     487500   86000
##  9 White / Black / Asian                              1907     460844  139000
## 10 White / Hawaiian, Pacific Islander                 1834     451128  103800
## # ℹ 11 more rows
left_join(household,person,by='CONTROL') %>% pivot_longer(col=c(TOTHCAMT,MARKETVAL,FINCP),names_to = "vv") %>% ggplot(aes(value,RACE,fill=vv))+geom_violin()+facet_grid(~vv,scales = "free")+theme(legend.position = 'none')

  • In the Pacific,Asian only has a higher housing expenditure than Black only. Asian only generally has a higher housing expenditure, while Black only has the lowest.
left_join(household,person,by='CONTROL') %>%
  filter(!is.na(RACE)) %>% 
  group_by(DIVISION,RACE) %>% summarise(n=n(),mTOTHCAMT=median(TOTHCAMT,na.rm=T),mFINCP=median(FINCP,na.rm=T)) %>% group_by(DIVISION) %>% 
  arrange(-n,.by_group = T) %>% 
  slice_head(n=3)
## # A tibble: 27 × 5
## # Groups:   DIVISION [9]
##    DIVISION           RACE           n mTOTHCAMT mFINCP
##    <chr>              <chr>      <int>     <dbl>  <dbl>
##  1 East North Central White only 13325      1100  69600
##  2 East North Central Black only  2963       940  30000
##  3 East North Central Asian only   848      1470  89002
##  4 East South Central White only  3765       820  50000
##  5 East South Central Black only  1665       749  22100
##  6 East South Central Asian only    81      1539  77050
##  7 Middle Atlantic    White only  9935      1440  70000
##  8 Middle Atlantic    Black only  3021      1065  32000
##  9 Middle Atlantic    Asian only  1082      1807  89700
## 10 Mountain           White only  8063      1354  69000
## # ℹ 17 more rows
  • Black only people are more likely to live in houses with fewer square footage and bedrooms, while Asian and white people are more likely to live in houses with larger Spaces, which may be safer during the pandemic.
household2 %>% left_join(person,by='CONTROL') %>% group_by(RACE) %>% summarise(n=n(),BEDROOMS=mean(BEDROOMS,na.rm = T),UNITSIZE=mean(UNITSIZE_midpoint,na.rm = T)) %>% arrange(-n)
## # A tibble: 21 × 4
##    RACE                                       n BEDROOMS UNITSIZE
##    <chr>                                  <int>    <dbl>    <dbl>
##  1 White only                             97395     3.02    1853.
##  2 Black only                             22358     2.79    1546.
##  3 Asian only                             10491     3.13    1995.
##  4 <NA>                                    8083     2.28    1371.
##  5 American Indian, Alaska Native only     1809     2.87    1552.
##  6 White / Black                           1103     2.98    1587.
##  7 White / American Indian, Alaska Native   723     2.85    1621.
##  8 White / Asian                            717     3.27    2053.
##  9 Hawaiian, Pacific Islander only          683     3.01    1751.
## 10 Black / American Indian, Alaska Native   183     2.76    1605.
## # ℹ 11 more rows
household2 %>% left_join(person,by='CONTROL') %>%filter(RACE%in%c('White only','Black only','Asian only')) %>%  ggplot(aes(y=BEDROOMS,x=UNITSIZE_midpoint,col=RACE)) +geom_point(position = "jitter",alpha=0.6,size=0.5)+geom_smooth(method = "lm",col='red')